System and method for the fusion of bottom-up whole-image features and top-down enttiy classification for accurate image/video scene classification

ABSTRACT

Described is a system and method for accurate image and/or video scene classification. More specifically, described is a system that makes use of a specialized convolutional-neural network (hereafter CNN) based technique for the fusion of bottom-up whole-image features and top-down entity classification. When the two parallel and independent processing paths are fused, the system provides an accurate classification of the scene as depicted in the image or video.

CROSS-REFERENCE TO RELATED APPLICATIONS

This is a continuation application of U.S. application Ser. No.15/427,597, filed on Feb. 8, 2017, which is a non-provisional patentapplication of U.S. Application No. 62/293,321, filed on Feb. 9, 2016,the entirety of which are hereby incorporated by reference.

STATEMENT REGARDING FEDERAL SPONSORED RESEARCH OR DEVELOPMENT

This invention was made with government support under the U.S.Government's ONR NEMESIS project, Contract Number N00014-15-C-0091.

The government has certain rights in the invention.

BACKGROUND OF THE INVENTION (1) Field of Invention

The present invention relates to a system and method for accurate imageand video scene classification and, more specifically, to a system andmethod that makes use of a specialized convolutional-neural networkbased technique for the fusion of bottom-up whole-image features andtop-down entity classification.

(2) Description of Related Art

Convolutional neural networks (CNN) are the state of the art for entityand scene classification using whole image features on large imagedatasets. However, existing CNN systems do not explicitly use entityinformation in tasks where that could be useful. Because the pixels ofthe entities are in the whole image input to the CNN, entity featurescan be detected and integrated by later layers, such that the CNN aloneto some extent serves as entity detector and whole image featureextractor at once (see the List of Incorporated Literature References,Literature Reference No. 3). This capability is limited whengeneralizing to scenes with entity relationships different from thoseseen in the training set.

Modifications to CNN's to learn better spatial relationships betweenfeatures are making recent advances (see Literature Reference No. 2),which combine a CNN with a recurrent neural network (RNN), by unfoldingthe image features from a spatial two-dimensional (2-D) domain to aone-dimensional (1-D) time series suitable for a RNN. Because of theflexible structural relationship between entities and scenes, learningthese spatial relationships will likely not perform well with anindependent entity recognition component. The unfolding is also done infixed scanning patterns that would be arbitrary in images without aconstrained orientation (e.g., as in aerial imagery).

Other methods for scene recognition using features designed by hand(e.g., SURF or HOG) or features from unsupervised learning (seeLiterature Reference No. 1) are somewhat effective, but generallyoutperformed by CNN based methods.

Scene classification has been a long-standing topic of interest as madeobvious by the listed prior art. The idea of automatically using imagedata as input and letting the algorithm learn and extract discriminatingfeatures via a deep learning method that fuses features and top-downinformation in a subsequent automated classification is not obvious orhas not been applied to this topic.

While traditional neural networks have been widely used, theseapproaches require feature selection that must be done by humans and donot have the feature learning capability of deep learning.

Thus, a continuing need exists for a system that uses image data asinput to let an algorithm learn and extract discriminating features viaa deep learning method that fuses features and top-down information in asubsequent automated classification.

SUMMARY OF THE INVENTION

This disclosure provides a system for scene classification. In variousembodiments, the system includes one or more processors and a memory.The memory is a non-transitory computer-readable medium havingexecutable instructions encoded thereon, such that upon execution of theinstructions, the one or more processors perform several operations,including operating at least two parallel, independent processingpipelines on an image or video to generate independent results; fusingthe independent results of the at least two parallel, independentprocessing pipelines to generate a fused scene class; and electronicallycontrolling machine behavior based on the fused scene class of the imageor video.

In another aspect, the at least two parallel, independent processingpipelines includes an entity processing pipeline that uses aconvolutional neural network (CNN) to identify a number and type ofentities in the image or video, resulting in an entity feature space.

In yet another aspect, the entity processing pipeline identifies andsegments potential object locations within the image or video andassigns a class label to each identified and segmented potential objectwithin the image or video. In another aspect, the entity feature spaceincludes a bag of words histogram feature.

In yet another aspect, the at least two parallel, independent processingpipelines includes a whole image processing pipeline that uses aconvolutional neural network (CNN) to extract visual features from thewhole image, resulting in a visual feature space.

Further, in fusing the independent results to generate the fused sceneclass, the visual feature space and entity feature space are combinedinto a single multi-dimensional combined feature, with a classifiertrained on the combined feature generating the fused scene class.

In another aspect, in fusing the independent results to generate thefused scene class, two classifiers are trained separately for each ofthe visual feature space and entity feature space to generateindependent class probability distributions over scene types, with theindependent class probability distributions being combined to generatethe fused scene class.

In yet another aspect, electronically controlling machine behaviorincludes at least one of labeling data associated with the image orvideo with the fused scene class, displaying the fused scene class withthe image or video, controlling vehicle performance, or controllingprocessor performance.

In another aspect, the system displays the image or video with a labelthat includes the fused scene class.

Finally, the present invention also includes a computer program productand a computer implemented method. The computer program product includescomputer-readable instructions stored on a non-transitorycomputer-readable medium that are executable by a computer having one ormore processors, such that upon execution of the instructions, the oneor more processors perform the operations listed herein. Alternatively,the computer implemented method includes an act of causing a computer toexecute such instructions and perform the resulting operations.

BRIEF DESCRIPTION OF THE DRAWINGS

The objects, features and advantages of the present invention will beapparent from the following detailed descriptions of the various aspectsof the invention in conjunction with reference to the followingdrawings, where:

FIG. 1 is a block diagram depicting the components of a system accordingto various embodiments of the present invention;

FIG. 2 is an illustration of a computer program product embodying anaspect of the present invention;

FIG. 3 is an illustration of a whole-image pipeline, whereinconvolutional neural network features are extracted from the wholeimage;

FIG. 4 is an illustration of an entity pipeline, wherein entities aredetected and classified using the Neovision2 method, and the entityclass/frequency of occurrence is encoded as a bag-of-word histogram;

FIG. 5 is an illustration of feature-level fusion; and

FIG. 6 is an illustration of class-level fusion.

DETAILED DESCRIPTION

The present invention relates to a system and method for accurate imageand video scene classification and, more specifically, to a system andmethod that makes use of a specialized convolutional-neural networkbased technique for the fusion of bottom-up whole-image features andtop-down entity classification. The following description is presentedto enable one of ordinary skill in the art to make and use the inventionand to incorporate it in the context of particular applications. Variousmodifications, as well as a variety of uses in different applicationswill be readily apparent to those skilled in the art, and the generalprinciples defined herein may be applied to a wide range of aspects.Thus, the present invention is not intended to be limited to the aspectspresented, but is to be accorded the widest scope consistent with theprinciples and novel features disclosed herein.

In the following detailed description, numerous specific details are setforth in order to provide a more thorough understanding of the presentinvention. However, it will be apparent to one skilled in the art thatthe present invention may be practiced without necessarily being limitedto these specific details. In other instances, well-known structures anddevices are shown in block diagram form, rather than in detail, in orderto avoid obscuring the present invention.

The reader's attention is directed to all papers and documents which arefiled concurrently with this specification and which are open to publicinspection with this specification, and the contents of all such papersand documents are incorporated herein by reference. All the featuresdisclosed in this specification, (including any accompanying claims,abstract, and drawings) may be replaced by alternative features servingthe same, equivalent or similar purpose, unless expressly statedotherwise. Thus, unless expressly stated otherwise, each featuredisclosed is one example only of a generic series of equivalent orsimilar features.

Furthermore, any element in a claim that does not explicitly state“means for” performing a specified function, or “step for” performing aspecific function, is not to be interpreted as a “means” or “step”clause as specified in 35 U.S.C. Section 112, Paragraph 6. Inparticular, the use of “step of” or “act of” in the claims herein is notintended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.

Before describing the invention in detail, first a list of incorporatedreferences is provided. Next, a description of the various principalaspects of the present invention is provided. Subsequently, anintroduction provides the reader with a general understanding of thepresent invention. Finally, specific details of various embodiment ofthe present invention are provided to give an understanding of thespecific aspects.

(1) List of Incorporated Literature References

The following references are cited throughout this application. Forclarity and convenience, the references are listed herein as a centralresource for the reader. The following references are herebyincorporated by reference as though fully set forth herein. Thereferences are cited in the application by referring to thecorresponding literature reference number.

1. Min Fu, Yuan Yuan, Xiaoqiang Lu, “Unsupervised Feature Learning forScene Classification of High Resolution Remote Sensing Images.” IEEEChina Summit and International Conference on Signal and InformationProcessing (ChinaSIP), 2015.

2. Zhen Zuo, Bing Shuai, Gang Wang, Xiao Liu, Xingxing Wang, Bing Wang,and Yushi Chen. “Convolutional Recurrent Neural Networks: LearningSpatial Dependencies for Image Representation.” In Proceedings of theConference on Computer Vision and Pattern Recognition Workshops (CVPRW),IEEE, 2015.

3. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. “ObjectDetectors Emerge in Deep Scene CNNs.” International Conference onLearning Representations (ICLR) oral, 2015.

4. U.S. Pat. No. 8,885,887, “System for object detection and recognitionin videos using stabilization,” 2014.

5. U.S. Pat. No. 8,965,115, “Adaptive multi-modal detection and fusionin videos via classification-based-learning,” 2015.

6. U.S. Pat. No. 9,008,366, “Bio-inspired method of ground object cueingin airborne motion imagery,” 2015.

7. U.S. Pat. No. 9,111,355, “Selective color processing for visionsystems that enables optimal detection and recognition,” 2015.

8. U.S. Pat. No. 9,165,208, “Robust ground-plane homography estimationusing adaptive feature selection,” 2015.

9. Deepak Khosla, Yang Chen, K. Kim, A Neuromorphic System for VideoObject Recognition, Frontiers in Computational Neuroscience, 2014; 8:147 (2014).

(2) Principal Aspects

Various embodiments of the present invention include three “principal”aspects. The first is a system and method for accurate image and/orvideo scene classification and, more specifically, to one that makes useof a specialized convolutional-neural network (CNN) based technique forthe fusion of bottom-up whole-image features and top-down entityclassification. The system is typically in the form of a computer systemoperating software or in the form of a “hard-coded” instruction set.This system may be incorporated into a wide variety of devices thatprovide different functionalities. The second principal aspect is amethod, typically in the form of software, operated using a dataprocessing system (computer). The third principal aspect is a computerprogram product. The computer program product generally representscomputer-readable instructions stored on a non-transitorycomputer-readable medium such as an optical storage device, e.g., acompact disc (CD) or digital versatile disc (DVD), or a magnetic storagedevice such as a floppy disk or magnetic tape. Other, non-limitingexamples of computer-readable media include hard disks, read-only memory(ROM), and flash-type memories. These aspects will be described in moredetail below.

A block diagram depicting an example of a system (i.e., computer system100) of the present invention is provided in FIG. 1. The computer system100 is configured to perform calculations, processes, operations, and/orfunctions associated with a program or algorithm. In one aspect, certainprocesses and steps discussed herein are realized as a series ofinstructions (e.g., software program) that reside within computerreadable memory units and are executed by one or more processors of thecomputer system 100. When executed, the instructions cause the computersystem 100 to perform specific actions and exhibit specific behavior,such as described herein.

The computer system 100 may include an address/data bus 102 that isconfigured to communicate information. Additionally, one or more dataprocessing units, such as a processor 104 (or processors), are coupledwith the address/data bus 102. The processor 104 is configured toprocess information and instructions. In an aspect, the processor 104 isa microprocessor. Alternatively, the processor 104 may be a differenttype of processor such as a parallel processor, application-specificintegrated circuit (ASIC), programmable logic array (PLA), complexprogrammable logic device (CPLD), or a field programmable gate array(FPGA).

The computer system 100 is configured to utilize one or more datastorage units. The computer system 100 may include a volatile memoryunit 106 (e.g., random access memory (“RAM”), static RAM, dynamic RAM,etc.) coupled with the address/data bus 102, wherein a volatile memoryunit 106 is configured to store information and instructions for theprocessor 104. The computer system 100 further may include anon-volatile memory unit 108 (e.g., read-only memory (“ROM”),programmable ROM (“PROM”), erasable programmable ROM (“EPROM”),electrically erasable programmable ROM “EEPROM”), flash memory, etc.)coupled with the address/data bus 102, wherein the non-volatile memoryunit 108 is configured to store static information and instructions forthe processor 104. Alternatively, the computer system 100 may executeinstructions retrieved from an online data storage unit such as in“Cloud” computing. In an aspect, the computer system 100 also mayinclude one or more interfaces, such as an interface 110, coupled withthe address/data bus 102. The one or more interfaces are configured toenable the computer system 100 to interface with other electronicdevices and computer systems. The communication interfaces implementedby the one or more interfaces may include wireline (e.g., serial cables,modems, network adaptors, etc.) and/or wireless (e.g., wireless modems,wireless network adaptors, etc.) communication technology.

In one aspect, the computer system 100 may include an input device 112coupled with the address/data bus 102, wherein the input device 112 isconfigured to communicate information and command selections to theprocessor 100. In accordance with one aspect, the input device 112 is analphanumeric input device, such as a keyboard, that may includealphanumeric and/or function keys. Alternatively, the input device 112may be an input device other than an alphanumeric input device. In anaspect, the computer system 100 may include a cursor control device 114coupled with the address/data bus 102, wherein the cursor control device114 is configured to communicate user input information and/or commandselections to the processor 100. In an aspect, the cursor control device114 is implemented using a device such as a mouse, a track-ball, atrack-pad, an optical tracking device, or a touch screen. The foregoingnotwithstanding, in an aspect, the cursor control device 114 is directedand/or activated via input from the input device 112, such as inresponse to the use of special keys and key sequence commands associatedwith the input device 112. In an alternative aspect, the cursor controldevice 114 is configured to be directed or guided by voice commands.

In an aspect, the computer system 100 further may include one or moreoptional computer usable data storage devices, such as a storage device116, coupled with the address/data bus 102. The storage device 116 isconfigured to store information and/or computer executable instructions.In one aspect, the storage device 116 is a storage device such as amagnetic or optical disk drive (e.g., hard disk drive (“HDD”), floppydiskette, compact disk read only memory (“CD-ROM”), digital versatiledisk (“DVD”)). Pursuant to one aspect, a display device 118 is coupledwith the address/data bus 102, wherein the display device 118 isconfigured to display video and/or graphics. In an aspect, the displaydevice 118 may include a cathode ray tube (“CRT”), liquid crystaldisplay (“LCD”), field emission display (“FED”), plasma display, or anyother display device suitable for displaying video and/or graphic imagesand alphanumeric characters recognizable to a user.

The computer system 100 presented herein is an example computingenvironment in accordance with an aspect. However, the non-limitingexample of the computer system 100 is not strictly limited to being acomputer system. For example, an aspect provides that the computersystem 100 represents a type of data processing analysis that may beused in accordance with various aspects described herein. Moreover,other computing systems may also be implemented. Indeed, the spirit andscope of the present technology is not limited to any single dataprocessing environment. Thus, in an aspect, one or more operations ofvarious aspects of the present technology are controlled or implementedusing computer-executable instructions, such as program modules, beingexecuted by a computer. In one implementation, such program modulesinclude routines, programs, objects, components and/or data structuresthat are configured to perform particular tasks or implement particularabstract data types. In addition, an aspect provides that one or moreaspects of the present technology are implemented by utilizing one ormore distributed computing environments, such as where tasks areperformed by remote processing devices that are linked through acommunications network, or such as where various program modules arelocated in both local and remote computer-storage media includingmemory-storage devices.

An illustrative diagram of a computer program product (i.e., storagedevice) embodying the present invention is depicted in FIG. 2. Thecomputer program product is depicted as floppy disk 200 or an opticaldisk 202 such as a CD or DVD. However, as mentioned previously, thecomputer program product generally represents computer-readableinstructions stored on any compatible non-transitory computer-readablemedium. The term “instructions” as used with respect to this inventiongenerally indicates a set of operations to be performed on a computer,and may represent pieces of a whole program or individual, separable,software modules. Non-limiting examples of “instruction” includecomputer program code (source or object code) and “hard-coded”electronics (i.e. computer operations coded into a computer chip). The“instruction” is stored on any non-transitory computer-readable medium,such as in the memory of a computer or on a floppy disk, a CD-ROM, and aflash drive. In either event, the instructions are encoded on anon-transitory computer-readable medium.

(3) Introduction

This disclosure provides a system for automatically classifying andlabeling the scene type depicted in an image or video (e.g., roadways,urban, water-body, vegetation, etc.). A unique aspect of the methodemployed by the system is in combining information from an entityprocessing pipeline (top-down information) and a whole image processingpipeline (bottom-up information). The top-down pipeline uses aconvolutional neural network (CNN) to identify the number and type ofentities in the image. The bottom-up pipeline uses a CNN to extractvisual features from the whole image. The entity description from thetop-down pipeline is converted into an entity feature space that iscombined with the visual feature space. A classifier is trained withsupervised learning on the combined feature space and used to produce ascene type label.

Using entity information helps to supplement a CNN when generalizing itto a new task. While CNN's are the state of the art for objectclassification and scene recognition tasks in computer vision, there aresome limitations to their generalizability to image sets other thanthose they were trained on. Typically, successful CNN's have several(e.g., 5-10) convolutional layers and fully connected neural networklayers, thus they have hundreds of thousands to tens of millions ofparameters. Assigning values to all these parameters involves trainingthe network in a supervised fashion on very large labelled imagedatasets. When the image dataset contains a broad enough variety ofimage classes, the early convolutional layers in the network learn basicimage features like edges, which can easily generalize to other imagesets, while the later stage layers learn mappings from the midlevelfeatures to the class labels particular to the training set. To takesuch a pre-trained CNN and apply it to a specific problem (e.g., UAVscouting), the later layers can be removed and a new final layer trainedin a smaller supervised training set for the new classification task.Combining the CNN output with information about the entities detected inthe image is a way to further adapt the CNN to the new task. Forpractical applications like autonomous driving or UAV surveillance,entity detection may already be part of the control softwarearchitecture and trained for the entities that are the most importantfor the application. Entities are informative about what scene theyoccur in. For example, airplanes suggest airfields, cars suggest roads,etc. The entities in a scene can help discriminate scene classes thatlook similar at the whole-image feature level (e.g., a runway on anairfield versus a lane on a road).

As can be appreciated by those skilled in the art, there are manypractical applications for the system as described herein. For example,the system can be employed in autonomous driving wherein entities (suchas pedestrians, cyclists, vehicles, etc.) are being detected and thescene classification (e.g., freeway vs. parking lot vs. dirt road) couldprovide context for various control modes. For example, if the systemdetermines an automobile is on a dirt road, it may automatically adjustthe suspension of the vehicle, such as releasing air from the tires orloosening the shocks or suspension. Another application is for scenerecognition for UAV surveillance and autonomous vehicles. For example,if a UAV is tasked with tracking a vehicle, it may limit its computingpower to tracking only when a scene is determined to be one in whichvehicles may be tracked, such as on roadways, whereas when flying over aforest the vehicle tracking operations may be disabled. Otherapplications include oil and gas exploration, automatic image/videoclassification for a variety of uses, forensics, etc. Thus, as can beappreciated, there are many applications in which scene classificationcan be employed by a number of systems. Specific details regarding thesystem are provided below.

(4) Specific Details of Various Embodiments

The architecture of the system described herein combines informationfrom two parallel, independent processing paths: one for the wholeimages (bottom-up) and one for the entities (top-down). For clarity,each of the processing paths are described below.

(4.1) Whole-Image Pipeline (Bottom-Up Information)

In the whole-image pipeline and as shown in FIG. 3, a whole image 300 isused to extract whole image features 310. The extracted whole imagefeatures 310 are extracted with a deep convolutional neural network(CNN) 302. The deep CNN 302 is formed of several layers performingfilter convolution 304, spatial pooling 306, and nonlinearrectification. The “fully connected” layers 308 are a type ofconvolution layer that are usually placed at the end of the network.Rectification is not explicitly drawn in the figure, as it is a simplefunction applied on a per-element basis after each pooling layer. Thenetwork 302 has been trained by supervised learning on a large imagedataset with class labels. The class labels used in training the networkdo not have to correspond to the classes of scenes to be classified bythe whole architecture, as it is expected that most of the earlier partsof the network are learning generalizable, low-level visual features.The final layer of the network that assigns a class label to the finalfeature space can be replaced with another classifier that is latertrained on the scene classes. For example, this last layer can bereplaced with a linear support vector machine or any other suitableclassifier. A class label is generated from the extracted features 310either via 500 (in combination with 404 and as shown in FIG. 5) infeature-level fusion or via 604 in class-level fusion (as shown in FIG.6).

For further understanding, upon receiving a whole image 300 (from asensor (e.g., camera), database, video stream, etc.), the systemperforms convolution 302 by convolving the image 300 with variousfilters. Each filter convolution generates a 2D “image” of networkactivations. Pooling 304 is then performed on those images ofactivations, resulting in smaller images. A rectification function isapplied at each element of the pooled images. The rectified activationscan then be in the input to the next convolution layer and so on. Aftersome number of these convolution-pooling-rectification stages, theactivations are input to a fully-connected layer 308, a type of neuralnetwork layer where each output is a function of all the inputs. It canbe thought of also as a convolution layer with filters the same size asthe input. One or two fully-connected layers 308 are typically used atthe end of the convolutional neural network. The output of the finalfully-connected layer 308 is the extracted feature 310.

(4.2) Entity Pipeline (Top-Down Information)

As shown in FIG. 4, the entity processing pipeline scans the image 300,identifies and segments potential object locations 400, and assigns anentity class label 402 to each potential object. The entity classifiercan also be a CNN whose last layer has been trained on the entity typesof interest. A non-limiting example of such an entity processingpipeline was described by the inventors in Literature Reference Nos. 4through 9, which are incorporated herein by reference. Thus, for eachimage 300, the entity processing pipeline produces a list of allentities in the image and their types (i.e., class labels 402). In oneexample embodiment, this is encoded into a bag-of-words histogramfeature 404. This feature 404 has a number of dimensions equal to thenumber of entity classes, and the value at each dimension is thefrequency (number) of such entities detected.

Other embodiments that encode both the spatial and frequency informationare also possible. For example, in another embodiment, the image can bedivided into N grids and the bag-of-words histogram can be generated foreach grid. The information from all grids is then combined byconcatenating each grid bag-of-words into a single bag-of-wordhistogram. This preserves spatial information that can be helpful inscene classification. For example, with a fixed camera orientation suchas in autonomous driving, roads are at the bottom of the scene, thenvehicles/pedestrians/street signs, then trees, then horizon/sky is atthe top. This spatial structure is encoded in such an embodiment and canbe used in scene classification.

(4.3) Fusion

As noted above, the two processing paths run in parallel. Theinformation from the two processing paths is fused either at the featurelevel or at the class probability level.

In feature-level fusion and as shown in FIG. 5, the two feature spacesare simply concatenated. For example, the whole image features 310(e.g., a 4096-dimensional image feature vector) and bag-of-wordsfeatures 404 (e.g., a 10-dimensional entity bag-of-words feature) arecombined into one multi-dimensional feature 500 (e.g., a4106-dimensional feature). Then a classifier (e.g., support vectormachine, etc.) is trained on the combined feature space 500 to generatethe resulting fused scene class 502 (which includes the class label andscore (i.e., probability)). In other embodiments, weighted or biasedfusion of these features can be done.

In class-level fusion and as shown in FIG. 6, information is combined atthe class probability level such that the scene class 600 from theentity pipeline is fused with the scene class 602 from the whole imagepipeline. More specifically, two classifiers 604 and 604′ are trainedseparately for the entities and whole image features. Each classifier604 and 604′ produces a class probability distribution over the scenetypes (i.e., scene classes 600 and 602). These distributions arecombined 606 (e.g., multiplied and renormalized) to produce the finalclassifier result 608 (fused scene class).

Based on the fused scene class, a number of actions can be taken by theassociated system. For example, the system can electronically controlmachine behavior based on the fused scene class of the image or video,such as labeling data associated with the image or video with the fusedscene class, displaying the fused scene class with the image or video,controlling vehicle performance (such as causing a mobile platform(e.g., vehicle, drone, UAV, etc.) to move or maneuver to or away from anidentified scene class (such as away from a building, or to a person,etc.)), or controlling processor performance (e.g., increasingprocessing speed when the class is a busy road, yet decrease speed whenthe class is an open ocean to conserve processing capacity and power,etc.). As another example, the system can be caused to display the imageor video (on a display) with a label that includes the fused sceneclass.

(4.4) Reduction-To-Practice

To demonstrate the effectiveness of the system described herein, thepresent invention has been reduced to practice and used to evaluate adataset that provides both scene labels and entity labels as groundtruth. First was the University of Southern California heli dataset,which is a set of aerial videos of Los Angeles. Entities were groundtruth labelled at the frame level for 10 classes, as follows: boat, bus,car, container, cyclist, helicopter, person, plane, tractor-trailer, andtruck. Images were given scene labels from the classes as follows:airfield, beach, industrial, natural, ocean, roads, and urban. In thisexample, images with a mixture of scenes or transitioning between sceneswere not used. Note that while the entity ground-truth was used for thisevaluation, the system can employed just as easily using the Neovision2processing results of entity detection/classification (see LiteratureReference Nos. 4-9). The main reason for using ground-truth is that itis generally more accurate than the Neovision2 results and a desiredpurpose of the present invention is to show the benefits of fusing theentity information with whole image features.

Four methods of scene classification were compared, as follows: entitiesonly, CNN features only, feature-level fusion, and class-level fusion.These methods achieved mean test classification accuracies of 50.4%,65.7%, 82.2%, and 83.0%, respectively. This shows the benefit of fusionof the two pipelines for scene classification.

Finally, while the present invention has been described in terms ofseveral embodiments, one of ordinary skill in the art will readilyrecognize that the present invention may have other applications inother environments. It should be noted that many embodiments andimplementations are possible. Further, the following claims are in noway intended to limit the scope of the present invention to the specificembodiments described above. In addition, any recitation of “means for”is intended to evoke a means-plus-function reading of an element and aclaim, whereas, any elements that do not specifically use the recitation“means for”, are not intended to be read as means-plus-functionelements, even if the claim otherwise includes the word “means”.Further, while particular method steps have been recited in a particularorder, the method steps may occur in any desired order and fall withinthe scope of the present invention.

What is claimed is:
 1. A system for scene classification, the systemcomprising: one or more processors and a memory, the memory being anon-transitory computer-readable medium having executable instructionsencoded thereon, such that upon execution of the instructions, the oneor more processors perform operations of: operating at least twoparallel, independent processing pipelines on a whole image to generateindependent results, wherein the at least two parallel, independentprocessing pipelines includes both an entity processing pipeline and awhole image processing pipeline, wherein the entity processing pipelineoperates on the whole image and uses a convolutional neural network(CNN) which scans the whole image to identify a number and type ofentities in the whole image, resulting in an entity feature space, andwherein the whole image processing pipeline uses a CNN to extract visualfeatures from the whole image, resulting in a visual feature space;fusing the independent results of the at least two parallel, independentprocessing pipelines to generate a fused scene class, wherein in fusingthe independent results to generate the fused scene class, the visualfeature space and the entity feature space are combined into a singlemulti-dimensional combined feature, with a classifier trained on thecombined feature generating the fused scene class; and electronicallycontrolling machine behavior based on the fused scene class of the wholeimage.
 2. The system as set forth in claim 1, wherein the entityprocessing pipeline identifies and segments potential object locationswithin the whole image and assigns a class label to each identified andsegmented potential object within the whole image.
 3. The system as setforth in claim 1, wherein the entity feature space includes a bag ofwords histogram feature.
 4. The system as set forth in claim 1, whereinelectronically controlling machine behavior includes at least one oflabeling data associated with the whole image with the fused sceneclass, displaying the fused scene class with the whole image,controlling vehicle performance, or controlling processor performance.5. The system as set forth in claim 1, further comprising an operationof displaying the whole image with a label that includes the fused sceneclass.
 6. A computer program product for scene classification, thecomputer program product comprising: a non-transitory computer-readablemedium having executable instructions encoded thereon, such that uponexecution of the instructions by one or more processors, the one or moreprocessors perform operations of: operating at least two parallel,independent processing pipelines on a whole image to generateindependent results, wherein the at least two parallel, independentprocessing pipelines includes both an entity processing pipeline and awhole image processing pipeline, wherein the entity processing pipelineoperates on the whole image and uses a convolutional neural network(CNN) which scans the whole image to identify a number and type ofentities in the whole image, resulting in an entity feature space, andwherein the whole image processing pipeline uses a CNN to extract visualfeatures from the whole image, resulting in a visual feature space;fusing the independent results of the at least two parallel, independentprocessing pipelines to generate a fused scene class, wherein in fusingthe independent results to generate the fused scene class, the visualfeature space and the entity feature space are combined into a singlemulti-dimensional combined feature, with a classifier trained on thecombined feature generating the fused scene class; and electronicallycontrolling machine behavior based on the fused scene class of the wholeimage.
 7. The computer program product as set forth in claim 6, whereinthe entity processing pipeline identifies and segments potential objectlocations within the whole image and assigns a class label to eachidentified and segmented potential object within the whole image.
 8. Thecomputer program product as set forth in claim 6, wherein the entityfeature space includes a bag of words histogram feature.
 9. The computerprogram product as set forth in claim 6, wherein electronicallycontrolling machine behavior includes at least one of labeling dataassociated with the whole image with the fused scene class, displayingthe fused scene class with the whole image, controlling vehicleperformance, or controlling processor performance.
 10. The computerprogram product as set forth in claim 6, further comprising an operationof displaying the whole image with a label that includes the fused sceneclass.
 11. A computer implemented method for scene classification, themethod comprising an act of: causing one or more processers to executeinstructions encoded on a non-transitory computer-readable medium, suchthat upon execution, the one or more processors perform operations of:operating at least two parallel, independent processing pipelines on awhole image to generate independent results, wherein the at least twoparallel, independent processing pipelines includes both an entityprocessing pipeline and a whole image processing pipeline, wherein theentity processing pipeline operates on the whole image and uses aconvolutional neural network (CNN) which scans the whole image toidentify a number and type of entities in the whole image, resulting inan entity feature space, and wherein the whole image processing pipelineuses a CNN to extract visual features from the whole image, resulting ina visual feature space; fusing the independent results of the at leasttwo parallel, independent processing pipelines to generate a fused sceneclass, wherein in fusing the independent results to generate the fusedscene class, the visual feature space and the entity feature space arecombined into a single multi-dimensional combined feature, with aclassifier trained on the combined feature generating the fused sceneclass; and electronically controlling machine behavior based on thefused scene class of the whole image.
 12. The method as set forth inclaim 11, wherein the entity processing pipeline identifies and segmentspotential object locations within the whole image and assigns a classlabel to each identified and segmented potential object within the wholeimage.
 13. The method as set forth in claim 11, wherein the entityfeature space includes a bag of words histogram feature.
 14. The methodas set forth in claim 11, wherein electronically controlling machinebehavior includes at least one of labeling data associated with thewhole image with the fused scene class, displaying the fused scene classwith the whole image, controlling vehicle performance, or controllingprocessor performance.
 15. The method as set forth in claim 11, furthercomprising an operation of displaying the whole image with a label thatincludes the fused scene class.